class: center, middle, inverse, title-slide # Adding to the story: Annotations, maps and interactions ### Abhijit Dasgupta, PhD ### Spring 2019 --- layout: true <div class="my-header"> <span>BIOF 439, Spring 2019</span></div> --- class:middle, center, inverse # Annotations --- ## Stand-alone stories - You would like a data visualization to stand on its own - Relevant information should be placed on the graph - However, you need to balance the information content with real estate - Don't clutter the graph and make it not readable --- ## An example <div id="origecon"></div>  --- ## Reconstructing this annotated graph .pull-left[ ```r library(tidyverse) econ_data <- rio::import('data/EconomistData.csv') ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ geom_point() ``` ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ ```r ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ geom_point() + geom_smooth(color='red', se=F) ``` #### Add a trend line ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ ```r ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ * geom_smooth(color='red', se=F) + * geom_point() ``` #### Reverse order so points are above line ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ ```r ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ geom_smooth(color='red', se=F) + * geom_point(shape = 1, size = 4, stroke=1.25) ``` #### Different shape for points ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ ```r pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spane", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore") ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ geom_smooth(color='red', se=F) + geom_point(shape = 1, size = 4, stroke=1.25) + geom_text(aes(label=Country), color = 'gray20') ``` #### Label countries ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ ```r pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spane", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore") ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ geom_smooth(color='red', se=F) + geom_point(shape = 1, size = 4, stroke=1.25) + geom_text(aes(label=Country), color = 'gray20', * data = econ_data %>% * filter(Country %in% pointsToLabel)) ``` #### Better, but labels are overlayed on points ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ ```r *library(ggrepel) pointsToLabel <- c("Russia", "Venezuela", "Iraq", "Myanmar", "Sudan", "Afghanistan", "Congo", "Greece", "Argentina", "Brazil", "India", "Italy", "China", "South Africa", "Spane", "Botswana", "Cape Verde", "Bhutan", "Rwanda", "France", "United States", "Germany", "Britain", "Barbados", "Norway", "Japan", "New Zealand", "Singapore") (plt <- ggplot(econ_data, aes(x = CPI, y = HDI, color=Region))+ geom_smooth(color='red', se=F) + geom_point(shape = 1, size = 4, stroke=1.25) + * geom_text_repel(aes(label=Country), color = 'gray20', force=20, data = econ_data %>% filter(Country %in% pointsToLabel))) ``` ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ Let's re-order the regions ```r econ_data$Region <- factor(econ_data$Region, levels = c("EU W. Europe", "Americas", "Asia Pacific", "East EU Cemt Asia", "MENA", "SSA"), labels = c("OECD", "Americas", "Asia &\nOceania", "Central &\nEastern Europe", "Middle East &\nnorth Africa", "Sub-Saharan\nAfrica")) *plt$data = econ_data *plt ``` ] .pull-right[ <!-- --> ] --- ## Reconstructing this annotated graph .pull-left[ Clean up the graphic ```r (plt_corrupt <- plt + scale_x_continuous(name = 'Corruption Perceptions Index', breaks = 1:10) + scale_y_continuous(name="Human Development Index", breaks = seq(0.2, 1, by = 0.1))+ scale_color_manual(name = '', values = c("#24576D", "#099DD7", "#28AADC", "#248E84", "#F2583F", "#96503F")) + ggtitle("Corruption and Human development")+ theme_bw()+ theme(legend.position='top', legend.direction='horizontal') ) ``` ] .pull-right[ <!-- --> ] --- class:middle,center # Adding derived statistics to a plot --- ## Adding group means .pull-left[ ```r ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+ geom_point()+ theme_bw() ``` ] .pull-right[ <!-- --> ] --- ## Adding group means .pull-left[ ```r means <- iris %>% group_by(Species) %>% summarize_at(vars(starts_with('Sepal')), mean) means ``` ``` #> # A tibble: 3 x 3 #> Species Sepal.Length Sepal.Width #> <fct> <dbl> <dbl> #> 1 setosa 5.01 3.43 #> 2 versicolor 5.94 2.77 #> 3 virginica 6.59 2.97 ``` ```r ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species))+ geom_point() + * geom_point(data = means, * size=5) + theme_bw() ``` ] .pull-right[ <!-- --> ] --- ## Adding regression metrics .pull-left[ Regress highway mileage on city mileage (data: mpg) ```r mod1 <- lm(hwy ~ cty, data = mpg) r2 <- broom::glance(mod1) %>% pull(r.squared) ggplot(mpg, aes(x = cty, y = hwy))+ geom_point() + geom_smooth(method = 'lm', se=F) + theme_bw() ``` ] .pull-right[ <!-- --> ] --- ## Adding regression metrics .pull-left[ Regress highway mileage on city mileage (data: mpg) ```r mod1 <- lm(hwy ~ cty, data = mpg) r2 <- broom::glance(mod1) %>% pull(r.squared) %>% round(., 2) ggplot(mpg, aes(x = cty, y = hwy))+ geom_point() + geom_smooth(method = 'lm', se=F)+ annotate(geom='text', x = 15, y = 40, label=glue::glue("R^2 == {r}",r=r2), size=12, parse=T) + theme_bw() ``` ] .pull-right[ <!-- --> ] --- ## Highlighting regions .pull-left[ ```r mpg %>% mutate(cyl = as.factor(cyl)) %>% ggplot(aes(x = cyl, y = hwy)) + geom_boxplot() + theme_bw() ``` ] .pull-right[ <!-- --> ] --- ## Highlighting regions .pull-left[ ```r mpg %>% mutate(cyl = as.factor(cyl)) %>% ggplot(aes(x = cyl, y = hwy)) + geom_boxplot() + theme_bw()+ annotate(geom = 'rect', xmin=3.75,xmax=4.25, ymin = 22, ymax = 28, fill = 'red', alpha = 0.2) + annotate('text', x = 4.5, y = 25, label = 'Outliers?', hjust = 0)+ coord_cartesian(xlim = c(0,5))+ theme_bw() ``` ] .pull-right[ <!-- --> ] --- class:center, middle,inverse # Maps --- For maps, we need a couple of new packages. - `sf`: Simple features in R - `rnaturalearth` & `rnaturalearthdata`: map data --- .pull-left[ ```r library(sf) library(rnaturalearth) library(rnaturalearthdata) world <- ne_countries(scale='medium', returnclass='sf') ggplot(data = world) + geom_sf() ``` ] .pull-right[ <!-- --> ] --- .pull-left[ ```r library(sf) library(rnaturalearth) library(rnaturalearthdata) world <- ne_countries(scale='medium', returnclass='sf') ggplot(data = world) + geom_sf(aes(fill = pop_est)) ``` ] .pull-right[ <!-- --> ] --- .pull-left[ ```r library(sf) library(rnaturalearth) library(rnaturalearthdata) world <- ne_countries(scale='medium', returnclass='sf') ggplot(data = world) + geom_sf(aes(fill = pop_est))+ scale_fill_viridis_c(option = 'plasma', trans='sqrt') ``` ] .pull-right[ <!-- --> ] --- ## Looking at Florida .pull-left[ ```r library(maps) states <- st_as_sf(maps::map('state', plot = F, fill = T)) %>% cbind(st_coordinates(st_centroid(.))) %>% mutate(ID = str_to_title(ID)) ggplot(data = world)+ geom_sf() + geom_sf(data = states, fill = NA) + geom_text(data = states, aes(X, Y, label = ID), size = 5) + coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = F) ``` ] .pull-right[ <!-- --> ] --- ## Looking at Florida .pull-left[ ```r library(maps) states <- st_as_sf(map('state', plot = F, fill = T)) %>% cbind(st_coordinates(st_centroid(states))) %>% mutate(ID = str_to_title(ID)) ggplot(data = world)+ geom_sf() + geom_sf(data = states, fill = NA) + * geom_label(data = states, aes(X, Y, label = ID), size = 5) + coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = F) ``` ] .pull-right[ <!-- --> ] --- ## Looking at the Florida elections .pull-left[ ```r source('data/florida.R') head(florida_election) florida_election <- florida_election %>% mutate_at(vars(ends_with('perc')), ~as.numeric(str_remove(., '%'))) ``` ] .pull-right[ ``` #> # A tibble: 6 x 14 #> County Bush Bush_perc Gore Gore_perc Nader Nader_perc Buchanan #> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> #> 1 Alach… 34135 39.80% 47380 55.25% 3228 3.76% 263 #> 2 Baker 5611 68.80% 2392 29.33% 53 0.65% 73 #> 3 Bay 38682 65.70% 18873 32.06% 830 1.41% 248 #> 4 Bradf… 5416 62.43% 3075 35.45% 84 0.97% 65 #> 5 Breva… 115253 52.75% 97341 44.55% 4471 2.05% 571 #> 6 Browa… 177939 30.93% 387760 67.41% 7105 1.24% 795 #> # … with 6 more variables: Buchanan_perc <chr>, Other <dbl>, #> # Other_perc <chr>, Margin <dbl>, Margin_perc <chr>, Total <dbl> ``` ] --- ## Looking at the Florida election .pull-left[ Now we need the map information ```r library(maps) counties <- st_as_sf(maps::map('county', plot = F, fill = T)) head(counties) counties <- counties %>% filter(str_detect(ID, 'florida')) counties <- counties %>% separate(ID, c('State','County'), sep = ',') %>% mutate_at(vars(State:County), str_to_title) ``` ] .pull-right[ ``` #> Simple feature collection with 6 features and 1 field #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -88.01778 ymin: 30.24071 xmax: -85.06131 ymax: 34.2686 #> epsg (SRID): 4326 #> proj4string: +proj=longlat +datum=WGS84 +no_defs #> geometry ID #> 1 MULTIPOLYGON (((-86.50517 3... alabama,autauga #> 2 MULTIPOLYGON (((-87.93757 3... alabama,baldwin #> 3 MULTIPOLYGON (((-85.42801 3... alabama,barbour #> 4 MULTIPOLYGON (((-87.02083 3... alabama,bibb #> 5 MULTIPOLYGON (((-86.9578 33... alabama,blount #> 6 MULTIPOLYGON (((-85.66866 3... alabama,bullock ``` ] --- ## Looking at the Florida election .pull-left[ The nice thing about the `sf` package is that it renders all the data into a data frame, so adding to it, or merging new data becomes easy. We will now merge the election data with the map data ```r election_by_county <- counties %>% left_join(florida_election) head(election_by_county) ``` ] .pull-right[ ``` #> Simple feature collection with 6 features and 15 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -85.98951 ymin: 25.94926 xmax: -80.08804 ymax: 30.57303 #> epsg (SRID): 4326 #> proj4string: +proj=longlat +datum=WGS84 +no_defs #> State County Bush Bush_perc Gore Gore_perc Nader Nader_perc #> 1 Florida Alachua 34135 39.80 47380 55.25 3228 3.76 #> 2 Florida Baker 5611 68.80 2392 29.33 53 0.65 #> 3 Florida Bay 38682 65.70 18873 32.06 830 1.41 #> 4 Florida Bradford 5416 62.43 3075 35.45 84 0.97 #> 5 Florida Brevard 115253 52.75 97341 44.55 4471 2.05 #> 6 Florida Broward 177939 30.93 387760 67.41 7105 1.24 #> Buchanan Buchanan_perc Other Other_perc Margin Margin_perc Total #> 1 263 0.31 751 0.88 -13245 -15.44 85757 #> 2 73 0.90 26 0.32 3219 39.47 8155 #> 3 248 0.42 243 0.41 19809 33.65 58876 #> 4 65 0.75 35 0.40 2341 26.99 8675 #> 5 571 0.26 852 0.39 17912 8.20 218488 #> 6 795 0.14 1640 0.29 -209821 -36.48 575239 #> geometry #> 1 MULTIPOLYGON (((-82.66062 2... #> 2 MULTIPOLYGON (((-82.04182 3... #> 3 MULTIPOLYGON (((-85.40509 3... #> 4 MULTIPOLYGON (((-82.4257 29... #> 5 MULTIPOLYGON (((-80.94747 2... #> 6 MULTIPOLYGON (((-80.89018 2... ``` ] --- ## Looking at the Florida election .pull-left[ Now we're ready to plot ```r ggplot(election_by_county) + geom_sf(aes(fill = Gore_perc)) + scale_fill_viridis_c(option = 'plasma') + labs(caption = 'Source: Wikipedia') ``` ] .pull-right[ <!-- --> ] --- ## Looking at the Florida election .pull-left[ We can clean this up a bit, and add surrounding states ```r (plt_map <- ggplot(data = world)+ geom_sf() + geom_sf(data = states, fill = NA) + geom_sf(data = election_by_county, aes(fill=Gore_perc)) + * geom_label(data = states %>% filter(ID != 'Florida'), aes(X, Y, label = ID), size = 5) + coord_sf(xlim = c(-88, -78), ylim = c(24.5, 33), expand = F) + labs(x = '', y = '', fill = 'Percentage for Gore') + scale_fill_viridis_c(option = 'plasma')) ``` ] .pull-right[ <!-- --> ] --- class: middle,center # Genomic data --- ## Visualizing a proteomic network .pull-left[ We read a dataset that contains the network relationships between different proteins ```r library(ggnetwork) datf <- rio::import('data/string_graph.txt') head(datf) ``` ] .pull-right[ ``` #> node1 node2 node1_string_id node2_string_id node1_external_id #> 1 CXCR3 CCR7 1855969 1843829 ENSP00000362795 #> 2 ITGA4 EED 1858446 1845338 ENSP00000380227 #> 3 SMC3 CENPK 1854200 1843648 ENSP00000354720 #> 4 HNRNPA1 LUC7L3 1852510 1843556 ENSP00000341826 #> 5 SMC2 RB1 1847012 1845924 ENSP00000286398 #> 6 RBBP4 CENPK 1855919 1843648 ENSP00000362592 #> node2_external_id neighborhood fusion cooccurence homology coexpression #> 1 ENSP00000246657 0 0 0 0.847 0.000 #> 2 ENSP00000263360 0 0 0 0.000 0.000 #> 3 ENSP00000242872 0 0 0 0.000 0.000 #> 4 ENSP00000240304 0 0 0 0.000 0.000 #> 5 ENSP00000267163 0 0 0 0.000 0.136 #> 6 ENSP00000242872 0 0 0 0.000 0.000 #> experimental knowledge textmining combined_score #> 1 0.000 0.9 0.878 0.913 #> 2 0.566 0.0 0.312 0.688 #> 3 0.000 0.9 0.081 0.904 #> 4 0.309 0.0 0.394 0.563 #> 5 0.000 0.9 0.097 0.915 #> 6 0.000 0.9 0.046 0.900 ``` ] --- ## Visualizing a proteomic network .pull-left[ The `igraph` package allows the creation of network graphs. However, here, we're only using it for data ingestion ```r grs <- igraph::graph_from_data_frame(datf[,c('node1','node2')], directed = F) grs ``` We see that this object holds the different connections. ] .pull-right[ ``` #> IGRAPH 5a60f1b UN-- 37 58 -- #> + attr: name (v/c) #> + edges from 5a60f1b (vertex names): #> [1] CXCR3 --CCR7 ITGA4 --EED SMC3 --CENPK HNRNPA1--LUC7L3 #> [5] SMC2 --RB1 RBBP4 --CENPK CXCR5 --CXCL13 CD44 --RUNX2 #> [9] CXCR5 --PF4 PF4 --THBD SMARCA1--EZH2 HMMR --BARD1 #> [13] MBP --MMP7 CCL19 --CCR7 RBBP4 --EZH2 RUNX2 --RB1 #> [17] RB1 --HSPA8 DHX9 --BARD1 CXCL13 --CCR7 SMC2 --KIF23 #> [21] CD44 --HMMR ITGA4 --CD44 RB1 --SMARCE1 ITGA4 --CCR7 #> [25] MBP --STK4 RBBP4 --LIN9 RB1 --EED CXCR5 --CCR7 #> [29] PSMA1 --HSPA8 RBBP4 --SMARCA1 CXCR3 --ITGA4 MBP --CDK12 #> + ... omitted several edges ``` ] --- ## Visualizing a proteomic network .pull-left[ We can then transform this data into `ggplot`-friendly data, to use `ggplot` for the plotting ```r ggdf <- ggnetwork(grs, layout = 'fruchtermanreingold') ggplot(ggdf, aes(x = x, y = y, xend = xend, yend = yend)) + * geom_edges(color = "black", * curvature = 0.1, size = 0.95, alpha = 0.8)+ * geom_nodes(aes(x = x, y = y), * size = 3, * alpha = 0.5, * color = "orange") + * geom_nodelabel_repel(aes(label = vertex.names), * size=4, color="#8856a7") + theme_blank() + theme(legend.position = "none") ``` ] .pull-right[ <!-- --> ] --- class: middle, center # Composing different genomic data into tracks --- ## The `ggbio` package The `ggbio` package has several functions that allow graphical representations of different genomic entities. .pull-left[ An ideogram ```r library(ggbio) p.ideo <- Ideogram(genome = 'hg19') p.ideo ``` ] .pull-right[ <!-- --> ] --- ## The `ggbio` package .pull-left[ Visualizing the gene model ```r library(Homo.sapiens) data(genesymbol, package='biovizBase') wh <- genesymbol[c('BRCA1','NBR1')] wh <- range(wh, ignore.strand=T) p.txdb <- autoplot(Homo.sapiens, which = wh) p.txdb ``` ] .pull-right[ <!-- --> ] --- ## The `ggbio` package .pull-left[ Putting it into tracks ```r library(GenomicRanges) gr17 <- GRanges("chr17", IRanges(41234415, 41234569)) tks <- tracks(p.ideo, gene = p.txdb) + xlim(gr17) tks + theme_tracks_sunset() ``` ] .pull-right[ <!-- --> ] --- class: middle, center # Interactive maps --- ## Using plotly .pull-left[ ```r library(plotly) ggplotly(plt_map) ``` ] .pull-right[
] --- ## Using real maps .pull-left[ ```r library(leaflet) m <- leaflet() %>% addTiles() %>% # Add default OpenStreetMap map tiles addMarkers(lng=174.768, lat=-36.852, popup="The birthplace of R") m ``` ] .pull-right[
] --- class: center, middle # Heatmaps --- ## Recall our heatmap .pull-left[ ```r library(Biobase) #data(sample.ExpressionSet) exdat <- readRDS('data/exprset.rds') library(limma) design1 <- model.matrix(~type, data=pData(exdat)) lm1 <- lmFit(exprs(exdat), design1) lm1 <- eBayes(lm1) # compute linear model for each probeset geneID <- rownames(topTable(lm1, coef = 2, number = 100, adjust.method = 'none', p.value = 0.05)) exdat2 <- exdat[geneID,] # Keep features with p-values < 0.05 library(Heatplus) reg1 <- regHeatmap(exprs(exdat2), legend=2, col=heat.colors, breaks=-3:3) plot(reg1) ``` ] .pull-right[ <!-- --> ] --- ## Using `heatmaply` .pull-left[ ```r library(heatmaply) heatmaply(exprs(exdat2)) ``` ] .pull-right[
] --- ## Using heatmaply .pull-left[ You can also easily use the correlation metric ```r heatmaply(exprs(exdat2), distfun = 'pearson') ``` ] .pull-right[
] --- ## Using `d3heatmap` .pull-left[ ```r library(d3heatmap) d3heatmap(exprs(exdat2)) ``` ] .pull-right[
] --- class: middle, center # Manhattan plots --- ```r *library(qqman) manhattan(gwasResults) ``` <!-- --> --- ```r library(manhattanly) manhattanly(gwasResults) ```
--- ## Interactions using plotly - Plot.ly is a web service that produces interactive graphics from data - They made their backend open-source - In R, you can interact with plot.ly using the package `plotly`. --- ## Interactions using plotly - Plotly makes it very easy to create interactive plots based on ggplot .pull-left[ ```r plt_corrupt ``` ] .pull-right[ <!-- --> ] --- ## Interactions using plotly - Plotly makes it very easy to create interactive plots based on ggplot .pull-left[ ```r library(plotly) ggplotly(plt_corrupt) ``` ] .pull-right[
]